TRE (computing)

TRE is an open-source library for texts search, which works like regular expression engine with ability of fuzzy string searching. It is developed by Ville Laurikari under 2-clause BSD-like license.

Library is written in C and provides functions which allow using regular expressions for searching over input text lines. Main difference from other regular expression engines is that TRE can match text fragments in approximate way - i.e. supposing that text could have some number of typos.

Contents

Features

Approximate matching

TRE uses extended regular expression syntax with addition of directions for matching preceding fragment in approximate way. Each of such directions specifies how much typos are allowed for this fragment.

Approximate matching is performed in a way similar to Levenshtein distance, which means that there are three types of typos 'recognized'[1]:

TRE allows specifying of cost for each of three typos type independently.

Command-line utility

Project comes with command-line utility (version of agrep) built automatically with library. It could be used for processing text files or for testing abilities of TRE.

Standard conformance

Though approximate matching requires some syntax extension, when this feature is not used, TRE works like most of other regexp matching engines. This means that

Predictable time and memory consumption

Author states[2] that time spent for matching grows linearly with increasing of input text length, while memory requirements are almost constant (tens of kilobytes). It is important, especially for possible uses in embedded systems which have comparatively few resources.

There is no information about benchmarking against other regular expression engines.

Other

Other features, common for most regex engines could be checked in regex engines comparison tables or in list of TRE features on its web-page.

Usage example

Approximate matching directions are specified in curly brackets and should be distinguishable from repetitive quantifiers (possibly with inserting a space after opening bracket):

Portability

C and C++

Being written in C, TRE could be ported (i.e. rebuilt) for any platform which have GNU C Compiler installed. Due to simplicity of sources (20-30 files placed in one folder) it is probable that porting with more specific compilers also is easy.

Other languages

For projects, written in languages, other than C/C++ TRE could be used in two ways:

Disadvantages

Since other regular expression engines usually do not provide approximate matching ability, there is almost no concurrent implementation with which TRE could be compared. However there are few things which programmers may wish to be implemented in future releases:

References

External links